What effect does industry and location have on the amount of startup Investments?

Introduction:

In this research paper, I will try to explore the relationship that the industry and location of a startup has with the amount of total startup investments. We can say that the idea behind the startup is the main reason for an investor to choose or reject a particular startup. However, as have seen many times in the past great ideas can get abandoned. In this paper, I will look at some factors (Industry & Location) and if there is any relationtionship between these factors and the amount funding that a startup receives. The data that will be used for this research can be found on Kaggle. The focus of our study will primarily be on American startups as it is harder to find complete data from other countries. Please note the words sector, category and industry will be used interchangeably for the purpose of this study. For the location we will be using the state code and therefore using location and state interchangeably in this paper. The reason for the chouse of independant variable is to see how factors outside of just the idea of startup can effect the the funding received by the startup. I have chosen location because I think that it is an important factor in determining the startup success and thus funding. As we know California's Silicon Valley is know to be one of the largest hubs for startups with many startups fighting for space to have their startups located in the Silicon Valley. Additionally California also posesses some of the finest instituitions such as Stanford, CalTech, etc. which are also known to produce are large volume of startups each year. After California, some of the biggest hubs in the startup space in the US are New York, Washington (mainly Seattle), and also Texas. For my second variable, sector, the reason I chose this variable is because of the many buzz words we hear on the news on a daily basis. I want to see if certain industries are "hot" amongst investors and whether opening a startup within a "bad" industry can be a death sentence for your startup.

Data Set Link: https://www.kaggle.com/justinas/startup-investments

Project 1

Since the focus of our study is to focus on companies we want to drop any values that represent

Summary Statistics

Lets interpret the results for the summary statistics of our Y variable. We can see that the total funding in USD has an average of 6.5 million dollars. You may be inclined to think that startups in the US must be enjoying a great amount of success with an average that high. However, lets have a look at the median. The median tells us that 50% of the startups in the dataset have a funding of 0 dollars. We have to be careful here as the mean can be misinforming because of its sensitivity to outliers.

From the summary of our other X variable (state). We can see that California has the highest number of startups in the United States more than 15 thousand. Note that the data contains 51 states this is because Washington DC is also considered as a state in our data.

This data has been log transformed to better understand our results. As we can see in the histogram above that the funding of a vast majority of startups lie in the 0 to one billion range. We will further breakdown this range and create a new histogram to further analyze the results.

We continue to notice the same trend even as we focus on a certain range. The distribution of the data remains skewed to the right.

Moving on to our X variables, let see the frequency of our X variables. I will be using bar charts to explain this relationship as both the variables are categorical variables.

Confirming what we saw from the summary statistics we can see that California possesses the highest number of startups by a huge margin compared to New York which has second place. This is likely due to the California being the home of the Silicon Valley and also the home of many large startup companies. Now lets plot the number of startups in each industry.

The software industry has the highest number of startups compared to any other industry. However the margin is not that huge as we earlier when we plotted the number of startup per state. Another thing to note is that the "Other" category also has a huge proportion of startup ranking at 3rd place. Governments and pets apear to have the smallest number of startups.

Since we saw from our histograms of the Y variable (funding_total_usd) earlier our data was extremely skewed to the right. I want to see what proportion of startups received no funding in order to understand the histograms better. We are going to do this by creating a groupby object and creating plots to see what percentage of startups received no funding per industry and per location seperately.

Again consistant with the figure above this figure also shows the enornous percentage of total startups with no funding grouped by their respective industries. We see that nanotech startups appear to have the lowest number of startups that receive no funding (less than 20%). Again this may be due to the fact that the nanotech industry contains significantly less number of startups compared to, for example, the software industry.

We can see that the highest median funding belongs to the the nanaotechnology sector. This likely is due to expensive nature of the industry requiring large amounts of capital and state of the art technologies.

Summary

Next Steps

For the next study, I will likely run a multiple regression against our Y variable(funding_total_usd) in order to better understand this relationship. I will also try to figure a way to integrate the funding rounds variable and see how that affects the total funding for the startups. Lastly, I will also try to merge different datasets to see if there are other variables that would be helpful for my analysis. My goal is to create a strong model for determining the investment a startup receives.

Project 2

The Message

As I spoke earlier the reason for this study is to see what effect, if any, does the startup location and sector have on the success, which we are measuring as the funding received, of a startup.

We are going to run a linear regression model with y as the outcome variable and the location as the dummy variable to further understand the effect of the location on the startup funding.

Here we ran a simple linear regression with the location as the independant variable and the total funding as the y variable. The x axis represents a dummy variable where 1 is represents being that state and 0 represents not being the state. Our linear regression model predicts the funding that a startup is likely to receive based on which state it is in. For simplicity I plotted the California, Massachesetts and New York on the last two graphs at the bottom. We can see some interesting results from this graph. Firstly, as confirmed by our previous methods Californian and Massachusetts' startups do have a better funding, that is being located in California and Massachusetts seems to have a positive effect on the amount of funding received as opposed to not being located in CA or MA.

The surprising result is that being located New York and Texas actually has an adverse effect on the amount of funding received, so being located in New York or Texas is actually disadvantageous to startups based on this result. In the other graph we seem some even more surprising results. Being located in Alabama is good for your startup! I'm not sure if this is entirely due to some error or whether this is a significant finding. We also noticed that Washington, Maryland, and Utah seem to also be favorible locations for startups.

Other states have not been labeled as a majority of them seem to be having an adverse affect on startup investments. This may be due to something we had previously looked at that the majority of startups do not recieve any funding, so the state might not be entirely held accountable for the poor funding

Here we can see the number of startups per state according to the map of the US. Note that Hawaii and alaska is not shown here because of the size of the map did not allow them to be included without disrupting the other states.

Here we can notice that the mean funding for the different states. Consistent with our result from the regression we can see Alabamas name come up once again. This map gives us a good picture of the average startup funding in each state.

Conclusion

Once again we have performed some more tests to check the relationship between startup funding and location. From our evidence so far we can suggest that location has some effect on the startup funding although we cannot be entirely sure due to hidden variables. For the next project we will dive deeper and maybe try a multiple regression along with some other techniques for visualizing our results

Project 3

Intro

As I want to see what effect the location has on the success of a startup, I also want to see if the finding I've had so far are consitent with other sources and research.In this section we look at whether the information for our previous experiments is accurate compared to an independent source. I will look at a CNBC article which ranks all the states in terms of an overall score for starting a business in the United States. The overall rank is determined by 85 metrics in 10 broad categories such as infrastructure, economy, etc. The rankings are from 2021, and the dataset also includes a ranking for how easy or hard it is to secure funding in each of the states. I beleieve this is important information that is missing from our original data and can help us learn more about our dataset. Particularly why some states have an enormous number of startups compared to other states. I will scrape this data from the website below and store it in a dataframe. The data from the website can then be merged with the orginal startup dataset to see whether startups in general seem to be following the advise from the aritcle. I am going to meausure this by looking at the number of startups in each state(from our original dataset) and then comparing that to the access to capital in that state and its overall ranking(from the article). I am not using an API for the scrapping so we will only need to scrape this data once.

The data source that I am using is easily scrappable as it is a website containing a table similar to what we've worked with in class.

Link to article: https://www.cnbc.com/2021/07/13/americas-top-states-for-business.html

Now we have all the data that we need for our experiment in a dataframe.

Our dataframe does not contain state codes only state names. If we are to merge it with the datasets we've been working with so far we need state codes. Therefore, I will create a list of state codes in the order of their overall ranking according to the CNBC website.

First, I am going to create a scatterplot just to look at the data that we've put into our dataframe to draw some observation about this new data. I am mainly interested in figuring out if there is some relationship between overall rank and the access to capital across the states.

As we can see here there is some relationship between access to capital and the overall rank however from our scatter plot above it seems that the relationship is not particularly strong. We can see states such as Nebraska, Idaho rank better overall however don't seem to do so well when it comes to access to capital. Alternatetively businesses in California seem to have greater access to funding however do not do so well overall.

Now that we have analyzed our extracted data. It is now time to merge it with our original data. I will merge our extracted data gb_loc dataframe that we had created earlier.

We're going to create an interactive map similar to what we had created earlier however this time the fill of our map will be represented by the rankings from the CNBC ariticle instead of the mean funding for each state. For this we will have to merge our best_states dataframe with the state_df dataframe that we downloaded earlier.

Conclusion

We can see some inconsistency between the mean funding across states which found previously and the overall ranking for best states to start a business. For example California which has the most amount of startups in the United States has lower overall ranking. Texas seems to be a state which happens to be a state that score high in the overall ranking and was also had a relatively high average finding. Another factor that may have been cause an underlying difference between the ranking and the mean funding could be that the ranking are from 2021, whereas some of the startups in our dataset have been in business for far longer. It is reasonable to assume that the economic conditions in each of the states have changed over the years in order to create a suitable environment for startups or vice versa. Our findings so far have have suggested that being located in certain states is beneficial to your business. Specifically it seems that Washington and Texas have been better location for your startup to be successful in recent years.

Link to view interactive maps and plots: https://utoronto-my.sharepoint.com/:u:/g/personal/momair_khan_mail_utoronto_ca/ETNwqu4YAsdFvxsvpI19tWYBr74hEu6c-LoGSjjs4RkwSQ?e=PJwVSw